14 research outputs found

    Latency Analysis of Coded Computation Schemes over Wireless Networks

    Full text link
    Large-scale distributed computing systems face two major bottlenecks that limit their scalability: straggler delay caused by the variability of computation times at different worker nodes and communication bottlenecks caused by shuffling data across many nodes in the network. Recently, it has been shown that codes can provide significant gains in overcoming these bottlenecks. In particular, optimal coding schemes for minimizing latency in distributed computation of linear functions and mitigating the effect of stragglers was proposed for a wired network, where the workers can simultaneously transmit messages to a master node without interference. In this paper, we focus on the problem of coded computation over a wireless master-worker setup with straggling workers, where only one worker can transmit the result of its local computation back to the master at a time. We consider 3 asymptotic regimes (determined by how the communication and computation times are scaled with the number of workers) and precisely characterize the total run-time of the distributed algorithm and optimum coding strategy in each regime. In particular, for the regime of practical interest where the computation and communication times of the distributed computing algorithm are comparable, we show that the total run-time approaches a simple lower bound that decouples computation and communication, and demonstrate that coded schemes are Θ(log⁑(n))\Theta(\log(n)) times faster than uncoded schemes

    EM for Mixture of Linear Regression with Clustered Data

    Full text link
    Modern data-driven and distributed learning frameworks deal with diverse massive data generated by clients spread across heterogeneous environments. Indeed, data heterogeneity is a major bottleneck in scaling up many distributed learning paradigms. In many settings however, heterogeneous data may be generated in clusters with shared structures, as is the case in several applications such as federated learning where a common latent variable governs the distribution of all the samples generated by a client. It is therefore natural to ask how the underlying clustered structures in distributed data can be exploited to improve learning schemes. In this paper, we tackle this question in the special case of estimating dd-dimensional parameters of a two-component mixture of linear regressions problem where each of mm nodes generates nn samples with a shared latent variable. We employ the well-known Expectation-Maximization (EM) method to estimate the maximum likelihood parameters from mm batches of dependent samples each containing nn measurements. Discarding the clustered structure in the mixture model, EM is known to require O(log⁑(mn/d))O(\log(mn/d)) iterations to reach the statistical accuracy of O(d/(mn))O(\sqrt{d/(mn)}). In contrast, we show that if initialized properly, EM on the structured data requires only O(1)O(1) iterations to reach the same statistical accuracy, as long as mm grows up as eo(n)e^{o(n)}. Our analysis establishes and combines novel asymptotic optimization and generalization guarantees for population and empirical EM with dependent samples, which may be of independent interest

    Robust and Communication-Efficient Collaborative Learning

    Get PDF
    We consider a decentralized learning problem, where a set of computing nodes aim at solving a non-convex optimization problem collaboratively. It is well-known that decentralized optimization schemes face two major system bottlenecks: stragglers' delay and communication overhead. In this paper, we tackle these bottlenecks by proposing a novel decentralized and gradient-based optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two main ideas: (i) we impose a deadline on the local gradient computations of each node at each iteration of the algorithm, and (ii) the nodes exchange quantized versions of their local models. The first idea robustifies to straggling nodes and the second alleviates communication efficiency. The key technical contribution of our work is to prove that with non-vanishing noises for quantization and stochastic gradients, the proposed method exactly converges to the global optimal for convex loss functions, and finds a first-order stationary point in non-convex scenarios. Our numerical evaluations of the QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate speedups of up to 3x in run-time, compared to state-of-the-art decentralized optimization methods

    Variance-reduced Clipping for Non-convex Optimization

    Full text link
    Gradient clipping is a standard training technique used in deep learning applications such as large-scale language modeling to mitigate exploding gradients. Recent experimental studies have demonstrated a fairly special behavior in the smoothness of the training objective along its trajectory when trained with gradient clipping. That is, the smoothness grows with the gradient norm. This is in clear contrast to the well-established assumption in folklore non-convex optimization, a.k.a. LL--smoothness, where the smoothness is assumed to be bounded by a constant LL globally. The recently introduced (L0,L1)(L_0,L_1)--smoothness is a more relaxed notion that captures such behavior in non-convex optimization. In particular, it has been shown that under this relaxed smoothness assumption, SGD with clipping requires O(Ο΅βˆ’4)O(\epsilon^{-4}) stochastic gradient computations to find an Ο΅\epsilon--stationary solution. In this paper, we employ a variance reduction technique, namely SPIDER, and demonstrate that for a carefully designed learning rate, this complexity is improved to O(Ο΅βˆ’3)O(\epsilon^{-3}) which is order-optimal. Our designed learning rate comprises the clipping technique to mitigate the growing smoothness. Moreover, when the objective function is the average of nn components, we improve the existing O(nΟ΅βˆ’2)O(n\epsilon^{-2}) bound on the stochastic gradient complexity to O(nΟ΅βˆ’2+n)O(\sqrt{n} \epsilon^{-2} + n), which is order-optimal as well. In addition to being theoretically optimal, SPIDER with our designed parameters demonstrates comparable empirical performance against variance-reduced methods such as SVRG and SARAH in several vision tasks

    Robust Federated Learning: The Case of Affine Distribution Shifts

    Full text link
    Federated learning is a distributed paradigm that aims at training models using samples distributed across multiple users in a network while keeping the samples on users' devices with the aim of efficiency and protecting users privacy. In such settings, the training data is often statistically heterogeneous and manifests various distribution shifts across users, which degrades the performance of the learnt model. The primary goal of this paper is to develop a robust federated learning algorithm that achieves satisfactory performance against distribution shifts in users' samples. To achieve this goal, we first consider a structured affine distribution shift in users' data that captures the device-dependent data heterogeneity in federated settings. This perturbation model is applicable to various federated learning problems such as image classification where the images undergo device-dependent imperfections, e.g. different intensity, contrast, and brightness. To address affine distribution shifts across users, we propose a Federated Learning framework Robust to Affine distribution shifts (FLRA) that is provably robust against affine Wasserstein shifts to the distribution of observed samples. To solve the FLRA's distributed minimax problem, we propose a fast and efficient optimization method and provide convergence guarantees via a gradient Descent Ascent (GDA) method. We further prove generalization error bounds for the learnt classifier to show proper generalization from empirical distribution of samples to the true underlying distribution. We perform several numerical experiments to empirically support FLRA. We show that an affine distribution shift indeed suffices to significantly decrease the performance of the learnt classifier in a new test user, and our proposed algorithm achieves a significant gain in comparison to standard federated learning and adversarial training methods
    corecore